HW 2
Fall 2020, DSPA (HS650)
Name: Danny Siu
SID: #### - 9281 (last 4 digits only)
UMich E-mail: dsiu@umich.edu
I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
Remember you are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.
Load the following two datasets separately, generate summary statistics for all features, plot some of the features using histograms, box plots, density plots, etc., as appropriate, and save the summaries locally as Text files.
I made sure to load the data with external downlaod links and sources so that the work is replicable. There are 101 features in ALS_train, but 131 features in ALS_test. Within the knee pain dataset there are x and y coordinate with 4 types of views in a long format.
## ID Age_mean Albumin_max Albumin_median Albumin_min Albumin_range ALSFRS_slope
## 1 1 65 57 40.5 38 0.066202091 -0.9656085
## 2 2 48 45 41.0 39 0.010452962 -0.9217172
## 3 3 38 50 47.0 45 0.008928571 -0.9147870
## 4 4 63 47 44.0 41 0.012111135 -0.5983607
## 5 5 63 47 45.5 42 0.008291874 -0.4440389
## 6 6 36 51 47.0 46 0.009057971 -0.1183528
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## 1 30 28.0 22 0.02116402
## 2 37 33.0 21 0.02872531
## 3 24 14.0 10 0.02500000
## 4 30 29.0 24 0.01496259
## 5 32 27.5 20 0.02037351
## 6 37 34.5 27 0.01811594
## ALT.SGPT._max ALT.SGPT._median ALT.SGPT._min ALT.SGPT._range AST.SGOT._max
## 1 24 22.0 18 0.02090592 31
## 2 25 13.0 8 0.02961672 31
## 3 25 20.0 14 0.01964286 24
## 4 62 60.0 41 0.05236908 46
## 5 38 26.5 22 0.02653400 35
## 6 34 23.0 18 0.02898551 31
## AST.SGOT._median AST.SGOT._min AST.SGOT._range Bicarbonate_max
## 1 27.5 23 0.02787456 30
## 2 17.0 14 0.02961672 32
## 3 19.0 18 0.01071429 35
## 4 40.0 33 0.03241895 23
## 5 26.5 20 0.02487562 32
## 6 26.0 21 0.01811594 29
## Bicarbonate_median Bicarbonate_min Bicarbonate_range
## 1 28 25 0.017421603
## 2 28 25 0.012195122
## 3 29 24 0.019642857
## 4 20 20 0.007481297
## 5 28 23 0.014925373
## 6 26 22 0.012681159
## Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
## 1 8.0322 7.11945
## 2 8.3973 4.74630
## 3 5.4765 4.38120
## 4 8.0322 8.03220
## 5 5.1114 4.19865
## 6 6.5718 5.11140
## Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range
## 1 6.5718 0.005088502
## 2 4.0161 0.007632753
## 3 3.6510 0.003259821
## 4 6.5718 0.003641895
## 5 3.6510 0.002421891
## 6 4.0161 0.004629891
## bp_diastolic_max bp_diastolic_median bp_diastolic_min bp_diastolic_range
## 1 90 83 69 0.05555556
## 2 80 78 64 0.02872531
## 3 86 76 58 0.05000000
## 4 90 80 70 0.04987531
## 5 100 80 68 0.05306799
## 6 84 80 60 0.04347826
## bp_systolic_max bp_systolic_median bp_systolic_min bp_systolic_range
## 1 160 139.0 129 0.08201058
## 2 140 132.5 104 0.06463196
## 3 120 110.0 90 0.05357143
## 4 150 130.0 120 0.07481297
## 5 160 130.0 104 0.09286899
## 6 140 115.0 100 0.07246377
## Calcium_max Calcium_median Calcium_min Calcium_range Chloride_max
## 1 2.49500 2.220550 2.22055 0.000956272 109
## 2 2.32035 2.170650 2.02095 0.000521603 108
## 3 2.47005 2.295400 2.19560 0.000490089 108
## 4 2.47005 2.345300 2.23000 0.000473934 109
## 5 2.42015 2.257975 2.17065 0.000413765 107
## 6 2.39520 2.270450 2.17065 0.000406793 110
## Chloride_median Chloride_min Chloride_range Creatinine_max Creatinine_median
## 1 108 103 0.020905923 79.56 79.56
## 2 102 100 0.013937282 61.88 53.04
## 3 106 104 0.007142857 88.40 79.56
## 4 107 106 0.007481297 70.72 61.88
## 5 104 100 0.011608624 61.88 48.62
## 6 105 101 0.016304348 106.08 88.40
## Creatinine_min Creatinine_range Gender_mean Glucose_max Glucose_median
## 1 70.72 0.03080139 1 7.4370 4.4955
## 2 44.20 0.03080139 1 6.7710 4.9950
## 3 70.72 0.03157143 2 5.6610 5.1060
## 4 53.04 0.04408978 2 5.1060 4.7730
## 5 26.52 0.05864013 1 7.4925 5.7165
## 6 70.72 0.06405797 2 5.5500 5.1060
## Glucose_min Glucose_range hands_max hands_median hands_min hands_range
## 1 4.2180 0.011216028 8 7.5 6 0.005291005
## 2 4.0515 0.004737805 8 6.0 6 0.003590664
## 3 4.2180 0.002576786 4 1.0 0 0.007142857
## 4 4.6620 0.001107232 6 5.5 4 0.004987531
## 5 5.0505 0.004049751 8 6.5 3 0.008488964
## 6 4.4400 0.002010870 8 7.0 5 0.005434783
## Hematocrit_max Hematocrit_median Hematocrit_min Hematocrit_range
## 1 44.6 43.15 40.7 0.013588850
## 2 41.9 39.60 37.7 0.007317073
## 3 49.1 46.20 44.0 0.009107143
## 4 46.3 43.00 41.7 0.011471322
## 5 44.0 42.85 39.5 0.007462687
## 6 46.8 43.50 41.9 0.008876812
## Hemoglobin_max Hemoglobin_median Hemoglobin_min Hemoglobin_range leg_max
## 1 156 146.0 143 0.04529617 8
## 2 138 132.0 128 0.01742160 8
## 3 161 154.0 151 0.01785714 4
## 4 154 145.0 144 0.02493766 4
## 5 152 146.5 138 0.02321725 2
## 6 157 146.0 142 0.02717391 8
## leg_median leg_min leg_range mouth_max mouth_median mouth_min mouth_range
## 1 6.5 4 0.010582011 5 3.5 0 0.013227513
## 2 7.5 3 0.008976661 9 8.0 4 0.008976661
## 3 3.0 2 0.003571429 10 7.0 4 0.010714286
## 4 3.5 2 0.004987531 12 12.0 12 0.000000000
## 5 2.0 0 0.003395586 12 12.0 12 0.000000000
## 6 8.0 4 0.007246377 9 8.0 7 0.003623188
## onset_delta_mean onset_site_mean Platelets_max Platelets_median Platelets_min
## 1 -1023 1 172 169.0 152
## 2 -341 1 286 264.0 230
## 3 -1181 1 233 213.0 167
## 4 -365 2 275 233.0 204
## 5 -1768 2 313 283.5 268
## 6 -334 1 220 194.0 178
## Potassium_max Potassium_median Potassium_min Potassium_range pulse_max
## 1 4.5 4.25 4.0 0.001742160 79
## 2 5.0 4.30 3.9 0.001916376 90
## 3 4.1 4.00 3.9 0.000357143 82
## 4 4.3 4.20 4.0 0.000748130 84
## 5 4.6 3.75 3.5 0.001824212 101
## 6 4.5 4.30 4.2 0.000543478 88
## pulse_median pulse_min pulse_range respiratory_max respiratory_median
## 1 68 61 0.04761905 4 3
## 2 76 64 0.04667864 4 4
## 3 73 60 0.03928571 4 4
## 4 72 68 0.03990025 3 3
## 5 96 74 0.04477612 4 4
## 6 66 60 0.05072464 4 4
## respiratory_min respiratory_range Sodium_max Sodium_median Sodium_min
## 1 3 0.002645503 148 145.5 143
## 2 3 0.001795332 142 138.0 136
## 3 4 0.000000000 145 143.0 140
## 4 3 0.000000000 143 139.0 138
## 5 3 0.001697793 143 140.0 138
## 6 3 0.001811594 145 141.0 137
## Sodium_range SubjectID trunk_max trunk_median trunk_min trunk_range
## 1 0.017421603 533 8 7 7 0.002645503
## 2 0.010452962 649 8 7 5 0.005385996
## 3 0.008928571 1234 5 0 0 0.008928571
## 4 0.012468828 2492 5 5 3 0.004987531
## 5 0.008291874 2956 6 4 1 0.008488964
## 6 0.014492754 3085 8 8 7 0.001811594
## Urine.Ph_max Urine.Ph_median Urine.Ph_min
## 1 6 6 6
## 2 7 5 5
## 3 6 5 5
## 4 7 6 5
## 5 6 5 5
## 6 8 6 5
## x Y View
## 1 11 73 RF
## 2 20 88 RF
## 3 19 73 RF
## 4 15 65 RF
## 5 21 57 RF
## 6 26 101 RF
Next, I want to make summary statistics for all of the features. For one, I notice that the Test dataset for ALS has more features than the training dataset. Let’s find out what those column differences are
## [1] "Basophils_max" "Basophils_median"
## [3] "Basophils_min" "Basophils_range"
## [5] "Bilirubin..total._max" "Bilirubin..total._median"
## [7] "Bilirubin..total._min" "Bilirubin..total._range"
## [9] "BMI_max" "Eosinophils_max"
## [11] "Eosinophils_median" "Eosinophils_min"
## [13] "Eosinophils_range" "Lymphocytes_max"
## [15] "Lymphocytes_median" "Lymphocytes_min"
## [17] "Lymphocytes_range" "Monocytes_max"
## [19] "Monocytes_median" "Monocytes_min"
## [21] "Monocytes_range" "Red.Blood.Cells..RBC._max"
## [23] "Red.Blood.Cells..RBC._median" "Red.Blood.Cells..RBC._min"
## [25] "Red.Blood.Cells..RBC._range" "Urine.Ph_range"
## [27] "White.Blood.Cell..WBC._max" "White.Blood.Cell..WBC._median"
## [29] "White.Blood.Cell..WBC._min" "White.Blood.Cell..WBC._range"
# Summarise the mean, sd, and median. Can add other summary statistics
sum_df = rbind(round(ALS_train %>% summarise_all("mean"),3),
round(ALS_train %>% summarise_all("sd"),3),
round(ALS_train %>% summarise_all("median"),3))
rownames(sum_df) <- c("mean","sd","median")
head(t(sum_df),30) #Display first 30## mean sd median
## ID 1214.875 696.678 1213.000
## Age_mean 54.550 11.397 55.000
## Albumin_max 47.011 3.234 47.000
## Albumin_median 43.953 2.655 44.000
## Albumin_min 40.766 3.193 41.000
## Albumin_range 0.014 0.010 0.012
## ALSFRS_slope -0.728 0.622 -0.621
## ALSFRS_Total_max 31.692 5.314 33.000
## ALSFRS_Total_median 27.105 6.634 28.000
## ALSFRS_Total_min 19.877 8.584 20.000
## ALSFRS_Total_range 0.026 0.016 0.023
## ALT.SGPT._max 54.436 44.830 45.000
## ALT.SGPT._median 32.993 15.602 30.000
## ALT.SGPT._min 23.015 11.231 21.000
## ALT.SGPT._range 0.071 0.111 0.048
## AST.SGOT._max 43.128 35.289 38.000
## AST.SGOT._median 29.077 9.594 27.000
## AST.SGOT._min 21.542 7.395 20.000
## AST.SGOT._range 0.049 0.084 0.035
## Bicarbonate_max 30.897 3.164 31.000
## Bicarbonate_median 26.964 2.199 27.000
## Bicarbonate_min 23.164 2.409 23.000
## Bicarbonate_range 0.017 0.011 0.015
## Blood.Urea.Nitrogen..BUN._max 7.353 2.320 6.937
## Blood.Urea.Nitrogen..BUN._median 5.558 1.335 5.423
## Blood.Urea.Nitrogen..BUN._min 4.161 1.354 4.070
## Blood.Urea.Nitrogen..BUN._range 0.007 0.005 0.006
## bp_diastolic_max 92.031 8.758 90.000
## bp_diastolic_median 81.113 7.246 80.000
## bp_diastolic_min 69.891 8.444 70.000
## [1] 0
## x Y View
## Min. : 11.0 Min. : 34.0 Length:8666
## 1st Qu.: 95.0 1st Qu.:192.0 Class :character
## Median :200.0 Median :210.0 Mode :character
## Mean :224.4 Mean :210.2
## 3rd Qu.:241.0 3rd Qu.:226.0
## Max. :642.0 Max. :380.0
## knee_df$View: LB
## x Y View
## Min. :384.0 Min. : 64.0 Length:924
## 1st Qu.:426.0 1st Qu.:186.0 Class :character
## Median :444.0 Median :194.0 Mode :character
## Mean :443.0 Mean :201.8
## 3rd Qu.:452.2 3rd Qu.:209.0
## Max. :498.0 Max. :374.0
## ------------------------------------------------------------
## knee_df$View: LF
## x Y View
## Min. :164.0 Min. : 49.0 Length:3369
## 1st Qu.:200.0 1st Qu.:197.0 Class :character
## Median :211.0 Median :213.0 Mode :character
## Mean :211.4 Mean :212.2
## 3rd Qu.:223.0 3rd Qu.:228.0
## Max. :287.0 Max. :368.0
## ------------------------------------------------------------
## knee_df$View: RB
## x Y View
## Min. :520.0 Min. : 34.0 Length:882
## 1st Qu.:563.0 1st Qu.:186.0 Class :character
## Median :572.0 Median :194.0 Mode :character
## Mean :572.7 Mean :201.4
## 3rd Qu.:590.0 3rd Qu.:209.0
## Max. :642.0 Max. :366.0
## ------------------------------------------------------------
## knee_df$View: RF
## x Y View
## Min. : 11.00 Min. : 57.0 Length:3491
## 1st Qu.: 80.00 1st Qu.:198.0 Class :character
## Median : 91.00 Median :213.0 Mode :character
## Mean : 91.06 Mean :212.6
## 3rd Qu.:104.50 3rd Qu.:227.0
## Max. :137.00 Max. :380.0
Next, plot the first 10 features using histograms, box plots, density plots Then, finally make a heatmap of all the features
# pairs(ALS_10[,2:11])
# heatmap(as.matrix(ALS_10[,2:11]))
# corrplot(cor(ALS_10[,2:11]), method="ellipse")
## Knee Data
## Based on these plots we can infer that X is the changing position here
g1 <- ggplot(knee_df,aes(View,x)) + geom_boxplot()
g2 <- ggplot(knee_df,aes(View,Y)) + geom_boxplot()
ggarrange(g1,g2)Use ALS case-study data and SOCR Knee Pain Data (Links to an external site.) to explore some bivariate relations (e.g. bivariate plot, correlation, table crosstable etc.) Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line or geom_bar]. Some sample code is included below.
First let’s look at ALS bivariate relationships
There seems to be evidence that the temperature in Ann Arbor is sloping upwards. This is more apparent in February and Jun from 1900s to 1950s. Putting everything on the same scale like a bar plot makes it much more difficult to discern any differences
##
## (37,40.3] (40.3,43.7] (43.7,47] (47,50.3] (50.3,53.6] (53.6,57]
## (17.9,30.6] 0 0 4 19 9 2
## (30.6,43.2] 4 21 70 222 71 7
## (43.2,55.8] 23 51 192 355 73 12
## (55.8,68.4] 31 81 283 372 40 3
## (68.4,81.1] 21 43 95 95 9 0
##
## (57,60.3] (60.3,63.6] (63.6,67] (67,70.3]
## (17.9,30.6] 0 0 0 0
## (30.6,43.2] 0 0 1 0
## (43.2,55.8] 1 0 0 0
## (55.8,68.4] 5 2 0 2
## (68.4,81.1] 1 2 1 0
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2223
##
##
## | cut(ALS_train$Glucose_max, 4)
## as.factor(ALS_train$Gender_mean) | (4.13,11.5] | (11.5,18.9] | (18.9,26.3] | (26.3,33.7] | Row Total |
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
## 1 | 779 | 24 | 2 | 1 | 806 |
## | 0.139 | 1.234 | 2.174 | 0.364 | |
## | 0.967 | 0.030 | 0.002 | 0.001 | 0.363 |
## | 0.367 | 0.289 | 0.133 | 0.200 | |
## | 0.350 | 0.011 | 0.001 | 0.000 | |
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
## 2 | 1341 | 59 | 13 | 4 | 1417 |
## | 0.079 | 0.702 | 1.237 | 0.207 | |
## | 0.946 | 0.042 | 0.009 | 0.003 | 0.637 |
## | 0.633 | 0.711 | 0.867 | 0.800 | |
## | 0.603 | 0.027 | 0.006 | 0.002 | |
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
## Column Total | 2120 | 83 | 15 | 5 | 2223 |
## | 0.954 | 0.037 | 0.007 | 0.002 | |
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
##
##
Then, let’s look at knee data
g1 <- ggplot(knee_df,aes(x,Y)) +
stat_density2d(geom="polygon", aes(alpha=..level..),
fill="blue",linetype=2)+theme_bw() +
facet_wrap(~View,scale="free") + ggtitle("Free axes") + theme(plot.title=element_text(hjust=0.5))
g2<- ggplot(knee_df,aes(x,Y)) +
stat_density2d(geom="polygon", aes(alpha=..level..),
fill="blue",linetype=2)+theme_bw() +
facet_wrap(~View) + ggtitle("Same axes") + theme(plot.title=element_text(hjust=0.5))
ggarrange(g1,g2, common.legend=TRUE)## # A tibble: 4 x 2
## View c
## <chr> <dbl>
## 1 LB 0.154
## 2 LF -0.107
## 3 RB -0.116
## 4 RF 0.0538
Now let’s take a look at the temperature data for the past century
There seems to be evidence that the temperature in Ann Arbor is sloping upwards. This is more apparent in February and Jun from 1900s to 1950s. Putting everything on the same scale like a bar plot makes it much more difficult to discern any differences
## Year Jan Feb Mar Apr
## Min. :1900 Min. :11.40 Min. :14.00 Min. :24.70 Min. :37.50
## 1st Qu.:1929 1st Qu.:20.95 1st Qu.:22.35 1st Qu.:32.65 1st Qu.:45.15
## Median :1958 Median :24.20 Median :26.10 Median :35.30 Median :47.60
## Mean :1958 Mean :24.22 Mean :25.45 Mean :35.39 Mean :47.55
## 3rd Qu.:1986 3rd Qu.:27.65 3rd Qu.:28.90 3rd Qu.:38.15 3rd Qu.:50.15
## Max. :2015 Max. :35.00 Max. :35.60 Max. :50.70 Max. :54.50
## NA's :1 NA's :1 NA's :1 NA's :1
## May Jun Jul Aug
## Min. :49.40 Min. :61.90 Min. :67.80 Min. :64.90
## 1st Qu.:56.60 1st Qu.:66.45 1st Qu.:71.10 1st Qu.:68.85
## Median :58.70 Median :68.40 Median :72.60 Median :70.90
## Mean :58.85 Mean :68.22 Mean :72.66 Mean :70.74
## 3rd Qu.:61.65 3rd Qu.:70.50 3rd Qu.:73.70 3rd Qu.:72.40
## Max. :66.60 Max. :74.30 Max. :78.90 Max. :76.20
## NA's :1 NA's :1
## Sep Oct Nov Dec
## Min. :54.80 Min. :42.70 Min. :32.90 Min. :17.70
## 1st Qu.:62.20 1st Qu.:50.20 1st Qu.:37.40 1st Qu.:25.55
## Median :63.80 Median :52.20 Median :39.80 Median :28.40
## Mean :63.82 Mean :52.29 Mean :39.76 Mean :28.39
## 3rd Qu.:65.75 3rd Qu.:54.30 3rd Qu.:41.85 3rd Qu.:31.65
## Max. :68.80 Max. :62.10 Max. :47.50 Max. :36.80
## NA's :1 NA's :1 NA's :1 NA's :1
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 2015 26.3 14.4 34.9 49.0 64.2 68.0 71.2 70.2 68.7 53.9 39.8 28.4
## 2 2014 24.4 19.4 29.0 48.9 60.7 69.7 68.8 70.8 63.2 52.1 35.4 33.3
## 3 2013 22.7 26.1 33.3 46.0 63.1 68.5 72.9 70.2 64.6 53.2 37.6 26.7
## 4 2012 22.4 32.8 50.7 49.2 65.2 71.4 78.9 72.2 63.9 51.7 39.6 34.8
## 5 2011 15.3 24.2 33.1 45.5 58.1 68.7 78.7 70.8 62.3 52.2 44.8 34.1
## 6 2010 18.4 27.4 42.0 54.5 61.9 70.5 75.3 74.3 64.2 55.0 41.4 25.2
list of variable names that define the different times or metrics (varying) r colN
the name we wish to give the variable containing these values in our long dataset (v.names), This would be Temps
the name we wish to give the variable describing the different times or metrics (timevar), This would be Months
the values this variable will have (times), and This would be the colnames
the end format for the data (direction) This would be wide to long
# Before reshaping make sure all data types are the same as putting them in 1 column will
# otherwise generate inconsistencies/errors
longTempData <- reshape(Temp_Data, varying = colN, v.names = "Temps", timevar="Months", times = colN, direction = "long")
# View(longTempData)
bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) +
geom_bar(stat = "identity", position="dodge")
print(bar2)## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Introduce (artificially) some missing data in the Knee Pain datasee, impute the missing values and examine the differences between the original, incomplete, and imputed datasets.
The histogram (in black) of the complete variable, the histogram (in blue) of the observed values and the histogram (in red) of the imputed values. The imputations perform quite well comparing the blue trace and the black histogram. With 5 chains for 15% missing data, we have approximately values near 1, which indicate a successful and reasonable imputation. Although I find it weird that all the chains yield very similar values for X and Y.
# create MCAR missing-data generator
create.missing <- function (data, pct.mis = 10)
{
n <- nrow(data)
J <- ncol(data)
if (length(pct.mis) == 1) {
if(pct.mis>= 0 & pct.mis <=100) {
n.mis <- rep((n * (pct.mis/100)), J)
}
else {
warning("Percent missing values should be an integer between 0 and 100! Exiting"); break
}
}
else {
if (length(pct.mis) < J)
stop("The length of the missing-vector is not equal to the number of columns in the data! Exiting!")
n.mis <- n * (pct.mis/100)
}
for (i in 1:ncol(data)) {
if (n.mis[i] == 0) { # if column has no missing do nothing.
data[, i] <- data[, i]
}
else {
data[sample(1:n, n.mis[i], replace = FALSE), i] <- NA
# For each given column (i), sample the row indices (1:n),
# a number of indices to replace as "missing", n.mis[i], "NA",
# without replacement
}
}
return(as.data.frame(data))
}
knee_missing<- cbind(create.missing(knee_df[,1:2], pct.mis=15), knee_df[3])
# datatable(knee_missing[1:10,])
mdf <- missing_data.frame(knee_missing)
image(mdf)## null device
## 1
## chain:1 chain:2 chain:3 chain:4 chain:5
## x 0.001 0.001 0.001 0.002 0.002
## Y -0.001 -0.002 0.004 0.000 0.007
## View 2.801 2.801 2.801 2.801 2.801
## missing_x 0.150 0.150 0.150 0.150 0.150
## missing_Y 0.150 0.150 0.150 0.150 0.150
## mean_x mean_Y sd_x sd_Y
## 1.2812240 1.0978629 0.8844699 0.9565299
Generate a surface plot for the (RF) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly and kernel density estimation).
Surface plots are an interesting way to visualize the data. I think it would not be advantageous for timeseries data or very dense datasets since the peaks may be missed. Humans are also bad at assessing 3D representations, so a surface plot may be more confusing than a simple heatmap. We can see with this colormap that the peaks are yellow, and the troughs are blue. I also found that plot_ly can be called more simply with the “z=kd$z” parameter instead of using with() which may add a layer of complexity. Experimenting with colors may also be necessary since some colormaps are not representative of the dataset (like jet).
Rebalance the groups of ALS (training data) patients according
It took a bit of playing around with the inputs for SMOTE and ubBalance. In particular, the perc.over and perc.under arguments, as well as their specific proportions still elude me. I found myself experimenting with the proportions until the two groups had equal amounts of samples. Specifically, it seems like over-sampling the minority class by 200% and over sampling the minority class by 100% is optimal to obtain the same N.
SMOTE is preferable downsampling the majority class since that simply decreases our N. Instead, SMOTE artificially generates new examples of the minority class using nearest neighbors of these cases. It also undersamples the majority class leading to a more balanced data set. In this case, it additionally oversampled the majority class by approximately 10%. Getting ubBalance to work required going into the source code and finding out we needed to input a positive parameter for the minority class.
set.seed(1234)
ALS_10$Class <- ifelse(ALS_10$Age_mean <= 50, 1,0)
ALS_10$Class <- factor(ALS_10$Class, labels=c("Young", "Old"))
CC_balancedClasses <- SMOTE(Class ~ ., data = ALS_10, perc.over = 100, perc.under=200)
# Using the DMwR package (Links to an external site.)# Alternatively, using the unbalanced package (Links to an external site.):
input <- ALS_10[ , -which(names(ALS_10) %in% c("Class"))]; output <- as.factor(ALS_10$Class)
Rebalanced_CC_data <- ubBalance(X=input, Y=output, type="ubSMOTE",positive="Old", percOver=100, percUnder=200, verbose=TRUE)## Proportion of positives after ubSMOTE : 50 % of 3152 observations
##
## Young Old
## 1435 788
##
## Young Old
## 1576 1576
##
## Young Old
## 1576 1576
counts <- rbind(table(ALS_10$Class),table(Rebalanced_CC_data$Y))
# counts <- CC_balancedClasses
barplot(t(counts), main="Before and After Rebalancing", names.arg=c("Before", "After"),
xlab="Raw vs. Rebalanced Data",ylab="Number of cases", col=c("orange","darkgreen"),
legend = c("Young", "Old"), beside=TRUE) # Mind the transposition of the counts table.